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Abstract 

The block coordinate descent (BCD) method is widely used for minimizing a continuous function / 
of several block variables. At each iteration of this method, a single block of variables is optimized, while 
the remaining variables are held fixed. To ensure the convergence of the BCD method, the subproblem 
to be optimized in each iteration needs to be solved exactly to its unique optimal solution. Unfortunately, 
these requirements are often too restrictive for many practical scenarios. In this paper, we study an 
alternative inexact BCD approach which updates the variable blocks by successively minimizing a 
sequence of approximations of / which are either locally tight upper bounds of / or strictly convex 
local approximations of /. We focus on characterizing the convergence properties for a fairly wide class 
of such methods, especially for the cases where the objective functions are either non-differentiable or 
nonconvex. Our results unify and extend the existing convergence results for many classical algorithms 
such as the BCD method, the difference of convex functions (DC) method, the expectation maximization 
(EM) algorithm, as well as the alternating proximal minimization algorithm. 

Index Terms 

Block Coordinate Descent, Block Successive Upper-bound Minimization, Successive Convex Ap- 
proximation, Successive Inner Approximation 



The authors are with the Department of Electrical and Computer Engineering, University of Minnesota, 200 Union Street SE, 
Minneapolis, MN 55455. Emails: {meisam,mhong,luozq}@ece.umn.edu. 



September 12, 2012 



DRAFT 



2 



I. Introduction 
Consider the following optimization problem 

min f(xi,...,x n ) 
s.t. Xi S Xi, i = 1, 2, . . . , n, 

where Xi C ]R m * is a closed convex set, and / : Yli=i %i — > K is a continuous function. A popular 
approach for solving the above optimization problem is the block coordinate descent method (BCD), 
which is also known as the Gauss-Seidel method. At each iteration of this method, the function is 
minimized with respect to a single block of variables while the rest of the blocks are held fixed. More 
specifically, at iteration r of the algorithm, the block variable Xi is updated by solving the following 
subproblem 

x[ = argmin f{x\, . . . , y { , xl+l, . . . , i = l,2,...,n. (1) 

Let us use {x r } to denote the sequence of iterates generated by this algorithm, where x r = (x\, . . . , x r n ). 
Due to its particular simple implementation, the BCD method has been widely used for solving problems 
such as power allocation in wireless communication systems ll28l . clustering |[T4l . image denoising and 
image reconstruction and dynamic programming ifTTl . 

Convergence of the BCD method typically requires the uniqueness of the minimizer at each step or the 
quasi-convexity of the objective function (see ||30l and the references therein). Without these assumptions, 
it is possible that the BCD iterates do not get close to any of the stationary points of the problem (see 
Powell ll24l for examples). Unfortunately, these requirements can be quite restrictive in some important 
practical problems such the tensor decomposition problem (see |[T9l and the application section in this 
work) and the sum rate maximization problem in wireless networks. In fact, for the latter case, even 
solving the per block subproblem (fl} is difficult due to the non-convexity and non-differentiability of the 
objective function. 

To overcome such difficulties, one can modify the BCD algorithm by optimizing a well-chosen 
approximate version of the objective function at each iteration. The classical gradient descent method, 
for example, can be viewed as an implementation of such strategy. To illustrate, recall that the update 
rule of the gradient descent method is given by 

x r+l =x r _ a r+l V f( x ry 
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This update rule is equivalent to solving the following problem 

x r+1 = argmin g(x,x r ), 

X 

where 

g(x, x r ) 4 + Vf(x r )(x - x r ) + ^+r\\x ~ x r \\ 2 . 

Clearly, the function g(x,x r ) is an approximation of /(■) around the point x r . In fact, as we will see 
later in this paper, successively optimizing an approximate version of the original objective is the key 
idea of many important algorithms such as the concave-convex procedure ll33l . the EM algorithm ifTOl . 
the proximal minimization algorithm O, to name a few. Furthermore, this idea can be used to simplify 
the computation and to guarantee the convergence of the original BCD algorithm with the Gauss-Seidel 
update rule (e.g. 011 , |[T2l . (32J). However, despite its wide applicability, there appears to be no general 
unifying convergence analysis for this class of algorithms. 

In this paper, we provide a unified convergence analysis for a general class of inexact BCD methods 
in which a sequence of approximate versions of the original problem are solved successively. Our focus 
will be on problems with nonsmooth and nonconvex objective functions. Two types of approximations 
are considered: one being a locally tight upper bound for the original objective function, the other being 
a convex local approximation of the objective function. We provide convergence analysis for both of 
these successive approximation strategies as well as for various types of updating rules, including the 
cyclic updating rule, the Gauss-Southwell update rule or the overlapping essentially cyclic update rule. 
By allowing inexact solution of subproblems, our work unifies and extends several existing algorithms 
and their convergence analysis, including the difference of convex functions (DC) method, the expectation 
maximization (EM) algorithm, as well as the alternating proximal minimization algorithm. 

II. Technical Preliminaries 

Throughout the paper, we adopt the following notations. We use M m to denote the space of m 
dimensional real valued vectors, which is also represented as the Cartesian product of n smaller real 
valued vector spaces, i.e., 

R m = R m 1 xM m2 x ... xl ra ", 

where Y17=i mi = m - ^ e use tne notation (0, . . . , dk, ■ ■ ■ , 0) to denote the vector of all zeros except the 
k-th block, with df. G W mh . The following concepts/definitions are adopted in our paper: 
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• Distance of a point from a set: Let S C R m be a set and x be a point in R m , the distance of the 
point x from the set S is defined as 

d(x, S) = inf \\x — s\\, 

where || • || denotes the 2-norm in R m . 

• Directional derivative: Let / : V — > R be a function where V C R m is a convex set. The directional 
derivative of / at point x in direction d is defined by 

f>(x;d)±lhnini f{x + Xd) - f{x) . 
J v ; A^O A 

• Stationary points of a function: Let / : V — > R be a function where 2? C R m is a convex set. 
The point x is a stationary point of /(•) if f'(x;d) > for all d such that x + d G P. In this paper 
we use the notation Af* to denote the set of stationary points of a function. 

• Regularity of a function at a point: The function / : R m — > R is regular at the point z e dom/ 
with respect to the coordinates mi, ?tt,2, . . . , m n , mi + m-2 + . . . + m n = m, if /'(z; d) > for 
all d = (d 1 ,d 2 ,.. .,d n ) with f(z;d° k ) > 0, where d° k = (0. . . , d k , . . . , 0) and d k G R mfc ,V fc. For 
detailed discussion on the regularity of a function, the readers are referred to ll30l Lemma 3.1]. 

• Quasi-convex function: The function / is quasi-convex if 

f(9x+(l-9)y) <max{/(x),/(y)}, V e € (0,1), Vx,yG dom/ 

• Coordinatewise minimum of a function: z G dom / C R m is coordinatewise minimum of / with 

respect to the coordinates in W"" 1 , K" 12 , . . . , U?" 1 " , mi + . . . + m^ = m if 

/(*+(0,...,d fc ,...,0))>/(z), V4eK"" with z + (0, . . . , dk, • • • , 0) G dom /. 

III. Successive Upper-bound Minimization (SUM) 

To gain some insights to the general inexact BCD method, let us first consider a simple Successive 
Upper-bound Minimization (SUM) approach in which all the variables are grouped into a single block. 
Although simple in form, the SUM algorithm is the key to many important algorithms such as the DC 
programming |33| and the EM algorithm (4l. 

Consider the following optimization problem 

min f(x) 

(2) 

s.t. x & X, 
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1 Find a feasible point x € X and set r = 

2 repeat 

3 r = r + 1 

4 Let <Y r = argmin xg ^ u(x, x r ~ 1 ) 

5 Set x r to be an arbitrary element in X r 

6 until some convergence criterion is met 



Fig. 1: Pseudo code of the SUM algorithm 



where X is a closed convex set. Without loss of generality, we can assume that dom f = X. When 
the objective function /(•) is non-convex and/or nonsmooth, solving (0 directly may not be easy. The 
SUM algorithm circumvents such difficulty by optimizing a sequence of approximate objective functions 
instead. More specifically, starting from a feasible point x°, the algorithm generates a sequence {x r } 
according to the following update rule 

x r G argmin u(x,x r ~ 1 ) (3) 

where x r ~ l is the point generated by the algorithm at iteration r — 1 and u(x, x r ~ l ) is an approximation 
of f(x) at the r-th iteration. Typically the approximate function u{-, •) needs to be chosen such that the 
subproblem ([3]) is easy to solve. Moreover, to ensure the convergence of the SUM algorithm, certain 
regularity conditions on u(-,-) is required (which will be discussed shortly). Among others, u{x,x r ~ l ) 
needs to be a global upper bound for f(x), hence the name of the algorithm. The main steps of the SUM 
algorithm are presented in Fig. Q] 

We remark that the proposed SUM algorithm is in many ways similar to the inner approximation 
algorithm (IAA) developed in II2T1 . with the following key differences: 

• The IAA algorithm approximates both the objective functions and the feasible sets. On the contrary, 
the SUM algorithm only approximates the objective function. 

• The the IAA algorithm is only applicable for problems with smooth objectives, while the SUM 
algorithm is able to handle nonsmooth objectives as well. 

It is worth mentioning that the existing convergence result for the IAA algorithm is quite weak. In 
particular, II2T1 Theorem 1] states that if the whole sequence converges, then the algorithm should converge 
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to a stationary point. In the following, we show that the SUM algorithm provides stronger convergence 
guarantees as long as the approximation function u(-,-) satisfies certain mild assumptions^] which we 
outline below. 

Assumption 1 Let the approximation function ■) satisfy the following 

u(y,y) = f(y), VyeX (Al) 

u(x,y)>f(x), Vx,yeX (A2) 
u'(x,y;d) 

u(x, y) is continuous in (x, y) (A4) 



= f'(y;d), Vd with y + d £ X (A3) 

x=y 



The assumptions (IA 1 b and (lA2b imply that the approximate function u(-, x r ~ l ) in (0) is a tight upper 
bound of the original function. The assumption (I A3 1 ) guarantees that the first order behavior of u(-, x r ~ 1 ) 
is the same as /(•) locally (note that the directional derivative u'(x,y;d) is only with respect to the 
variable x). Although directly checking (I A3 1 ) may not be easy, the following proposition provides a 
sufficient condition under which (I A3 1 ) holds true automatically. 

Proposition 1 Assume f(x) = fo(x) + fi(x), where /o(-) is continuously differentiate and the direc- 
tional derivative of /i(-) exists at every point x E X. Consider u(x, y) = uo(x, y) + fi(x), where uq(x, y) 
is a continuously differentiable function satisfying the following conditions 

Mv,y) = h(y), Vy^x (4) 

u {x,y) > f (x), Vx,y£X. (5) 

Then, (lATT ). dA2l and dA3]) hold for u(-, ■). 

Proof: First of all, (|4]) and ([5]) imply (IA 1 b and (IA2b immediately. Now we prove (IA3b by contradiction. 
Assume the contrary so that there exist a y G X and a d € M m so that 



/ u'(x,y;d) 

This further implies that 



and y + deX. (6) 



/o(y; d ) + u 'o(.x,y;d) 

'These assumptions are weaker than those made to ensure the convergence of the IAA algorithm. 
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Furthermore, since /o(-) and uo(-, •) are continuously differentiable, there exists a a > such that for 

z = y + ad, 

/oO;cf) / u' (x,z;d) 



(V) 



The assumptions (01) and §5§ imply that 



M (x,2;d) 



lim 



uq(z + Ad, z) — uq(z, z) 
A 



>lim /o(, + Ad)-/o(,) 
~ A4.0 A 

On the other hand, the differentiability of /o(-)> uo(-, •) and using (0]), © imply 

uq(^, 2) — uq(z — Ad, z) 



(8) 



u' (x,z;d) 



lim ■ 

A4,o 



A 



<lim /o(^)-/o(^-Ad) = 

~ A4.0 A v ; 



(9) 



Clearly, ([8]) and © imply that u (x,2:;d) 



f'(z;d) which contradicts (O. 



The following theorem establishes the convergence for the SUM algorithm. 

Theorem 1 Assume that Assumption \J\ is satisfied. Then every limit point of the iterates generated by 
the SUM algorithm is a stationary point of the problem (0. 



Proof: Firstly, we observe the following series of inequalities 



f(x r+1 ) < u(x r+ \x r ) < u(x r , x r ) = f(x r ), V r = 0, 1, 2, . . . 



(10) 



where step (i) is due to (| A 1 b . step (ii) follows from the optimality of x t+1 (cf. step 4 and 5 in FigfTJ, 
and the last equality is due to (IA2b . A straightforward consequence of (ITTJb is that the sequence of the 
objective function values are non-increasing, that is 



fix^yfix^yfix 2 )^... 



(ii) 



Assume that there exists a subsequence {x Tj } converging to a limit point z. Then Assumptions (IA1I ). 
lA2l together with (TTTb imply that 



u{x r *+\x Tl+1 ) = f(x r ^) < f(x r * +1 ) < u(x r i +1 ,x r >) < u(x,x r '), V x G X 
Letting j — > 00, we obtain 

u(z, z) < u(x, z), V x G X, 
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which implies 

u'(x,z;d) > 0, VdG R m with z + d£ X. 

x=z 

Combining with dA3l ), we obtain 

f'{z; d) > 0, VdG M m with z + de^, 
implying that z is a stationary point of /(•). ■ 

Corollary 1 Assume that the level set X° = {x | f(x) < f(x )} is compact and Assumption \T\ holds. 
Then, the sequence of iterates {x r } generated by the SUM algorithm satisfy 

lim d{x r ,X*)=0, 

where X* is the set of stationary points of ©. 

Proof: We prove the claim by contradiction. Suppose on the contrary that there exists a subsequence 
{x rj } such that d(x rj , X*) > 7 for some 7 > 0. Since the sequence {x T] } lies in the compact set X°, 
it has a limit point z. By further restricting the indices of the subsequence, we obtain 

d(z,X*) = lim d{x r \X*) > 7, 

j->oo 

which contradicts the fact that z G X* due to Theorem Q] ■ 
The above results show that under Assumption [H the SUM algorithm is globally convergent. In the 
rest of this work, we derive similar results for a family of more general inexact BCD algorithms. 

IV. The Block Successive Upper-bound Minimization Algorithm 

In many practical applications, the optimization variables can be decomposed into independent blocks. 
Such block structure, when judiciously exploited, can lead to low-complexity algorithms that are dis- 
tributedly implementable. In this section, we introduce the Block Successive Upper-bound Minimization 
(BSUM) algorithm, which effectively takes such block structure into consideration. 

Let us assume that the feasible set X is the cartesian product of n closed convex sets: X = X\ x . . .xX n , 
with Xi C R mi and £V rrii = m. Accordingly, the optimization variable x £ M m can be decomposed as: 
x = (xi,X2, . . . , x n ), with Xi G Xi, i = 1, • • • , M. We are interested in solving the problem 

min f(x) 

(12) 

s.t. x £ X. 
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1 Find a feasible point x € X and set r = 

2 repeat 

3 r = r + 1, i = (r mod n) + 1 

4 Let X r = argmin^g,^ u i (x i ,x r ~ 1 ) 

5 Set x\ to be an arbitrary element in X r 

6 Set x\ = 4 _1 , Vfc/f 

7 until some convergence criterion is met 



Fig. 2: Pseudo code of the BSUM algorithm 



Different from the SUM algorithm, the BSUM algorithm only updates a single block of variables in 
each iteration. More precisely, at iteration r, the selected block (say block i) is computed by solving the 
following subproblem 

min Ui(x{, x r ~ 1 ) 

(13) 

S . t . X % £ X% ; 

where x r ~ l ) is again an approximation (in fact, a global upper-bound) of the original objective /(•) 
at the point x r ~ 1 . Fig. |2] summarizes the main steps of the BSUM algorithm. Note that although the 
blocks are updated following a simple cyclic rule, the algorithm and its convergence results can be easily 
extended to the (more general) essentially cyclic update rule as well. This point will be further elaborated 
in Section IVlTl 

Now we are ready to study the convergence behavior of the BSUM algorithm. To this end, the following 
regularity conditions on the function •) are needed. 

Assumption 2 

Ui{yi,y) = f(y), VyeX,Vi (Bi) 

Ui(xi,y) > f(yi,...,yi-i,Xi,y i+ x,...,y n ), V x { G A^V y G X,V i (B2) 



u'i(xi,y,di) 



f'(y;d), Vd= (0,...,di,...,0) s.t. Vi + di £ X h V i (B3) 



Xi=yi 



Ui(xi, y) is continuous in (xi,y), V i (B4) 



September 12, 2012 



DRAFT 



10 



Similar to Proposition [JJ we can identify a sufficient condition to ensure (1B31 ). 

Proposition 2 Assume f(x) = fo(x) + fi(x), where fo(-) is continuously differentiable and the direc- 
tional derivative of exists at every point x G X. Consider Ui(xi,y) = u 0i (xi,y) + f\(x), where 
uo t i (xi , y) satisfies the following assumptions 

u ,i(xi,x) = fo(x), VieA', Vi 

u ,i(xi,y) > f (yi, . . . ,yi-i,Xi,y i+1 , . . . ,y n ), V x,y G X VI 

77jen, (IBTT) . (lB2l . and dB3j) hold. 

Proof: The proof is exactly the same as the proof in Proposition [TJ ■ 
The convergence results regarding to the BSUM algorithm consist of two parts. In the first part, a 
quasi-convexity of the objective function is assumed, which guarantees the existence of the limit points. 
This is in the same spirit of the classical proof of convergence for the BCD method in [2:]. However, 
if we know that the iterates lie in a compact set, then a stronger result can be proved. Indeed, in the 
second part of the theorem, the convergence is obtained by relaxing the quasi-convexity assumption while 
imposing the compactness assumption of level sets. 

Theorem 2 

(a) Suppose that the function Ui(xi,y) is quasi-convex in x« and Assumption [2] holds. Furthermore, 
assume that the subproblem (1131 ) has a unique solution for any point x r ~ 1 G X. Then, every limit 
point z of the iterates generated by the BSUM algorithm is a coordinatewise minimum of (112b . In 
addition, if /(•) is regular at z, then z is a stationary point of (1121 ). 

(b) Suppose the level set X° = {x | f(x) < f(x )} is compact and As sumption^ holds. Furthermore, 
assume that the subproblem (113b has a unique solution for any point x r ~ 1 G X for at least n — 1 
blocks. If /(•) is regular at every point in the set of stationary points X* with respect to the 
coordinates xi, . . . , x n . Then, the iterates generated by the BSUM algorithm converge to the set of 
stationary points, i.e., 

lim d(x r ,X*)=0. 

r— >oo 

Proof: The proof of part (a) is similar to the one in (2) for block coordinate descent approach. First 
of all, since a locally tight upper bound of /(•) is minimized at each iteration, we have 

fix^yfix^yfix 2 )^.... (14) 
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Therefore, the continuity of /(•) implies 

lim f(x r ) = f(z). (15) 

r— >oo 

Let us consider the subsequence {x T] } converging to the limit point z. Since the number of blocks 
is finite, there exists a block which is updated infinitely often in the subsequence {rj}. Without loss of 
generality, we assume that block n is updated infinitely often. Thus, by further restricting to a subsequence, 
we can write 

x^=argmin u n {x n ,x r '~ 1 ). 

Now we prove that x T ' 3+1 — > z, in other words, we will show that — > Z\. Assume the contrary that 
x{ does not converge to z\. Therefore by further restricting to a subsequence, there exists 7 > such 



that 



7 < 7 = |pi — %{ ||, V rj. 



Let us normalize the difference between x r { and x^ J+1 , i.e., 



r,+l r 
z{ — x l 

7 r j 



Notice that \\s T] \\ = 1, thus s Tj belongs to a compact set and it has a limit point s. By further restricting 
to a subsequence that converges to s, using (|B1I) and (|B2|) . we obtain 

/(s r ' +1 )<«i(z? + V') (16) 

= ui(a£ + 7 r V,a: r ') (17) 

< ui(a£ + e^,/0, V e G [0, 1] (18) 

<ni(^,^) (19) 

= /(x r 0, (20) 



where ([161) and (EPJ) hold due to (|BTb and (|E2l . The inequalities ([18]) and ([191) are the result of quasi- 
convexity of u(-,x T: >). Letting j — > 00 and combining (fT6l ). (fT8l) . (031 ). and (l20l) imply 

/(*) < + £7S, ^) < f(z), V e G [0,1], 

or equivalently 

f(z) = u 1 (z 1 + er/s,z), Ve€[0,l]. (21) 
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Furthermore, 

ui(x? +1 ,a; r ' +1 ) = f(x r i +1 ) < f(x r > +1 ) 

< ui(x r { +l ,x T3 ) < ui(xi,x r] ), Vzi G A4. 

Letting j — y 00, we obtain 

^1(^1, -z) < z), V xi G ^i, 

which further implies that z\ is the minimizer of u\{-, z). On the other hand, we assume that the minimizer 
is unique, which contradicts (|2TT) . Therefore, the contrary assumption is not true, i.e., x r ' 3+1 —> z. 

Since = argmin :Cl6 ^' 1 u\(xi,x r] ), we get 

ui(x r 1 3+1 ,x r ' 3 ) < ui{xi,x r ' ] ) Vxi G X\. 
Taking the limit j — > 00 implies 

ui(z±, z) < u\(x\, z) V X\ G X\, 

which further implies 

v^{x!,z;di) > 0, V di G M™ 1 with ^ + di G Afi. 

Similarly, by repeating the above argument for the other blocks, we obtain 

u' k (xk, z; dk) > 0, V dk G M m,c with d& + Zfc G A^, V A; = 1, . . . , n. (22) 

Combining dB3l ) and (1221 implies 

/'(z;d)>0, Vd= (0, ...,4,...,0) s.t. d + z£X, Vfc 
in other words, z is the coordinatewise minimum of /(•). 

Now we prove part (b) of the theorem. Without loss of generality, let us assume that ([131 ) has a unique 
solution at every point x r ~ 1 for i = 1, 2, . . . , n — 1. Since the iterates lie in a compact set, we only need 
to show that every limit point of the iterates is a stationary point of /(•). To do so, let us consider a 
subsequence {x 7 " 1 } which converges to a limit point z G X° C X. Since the number of blocks is finite, 
there exists a block i which is updated infinitely often in the subsequence {x T] }. By further restricting 
to a subsequence, we can assume that 

x T : ] G argmin uAxi, x rs ~ ). 

Xi 
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Since all the iterates lie in a compact set, we can further restrict to a subsequence such that 

lim x r >- i+k = z k , Vfc = 0,l,...,n, 

j->oo 

where z k £ X° C and = z. Moreover, due to the update rule in the algorithm, we have 

u k (x r k i - i+k ,x r ^ i+k - 1 ) <u fc (: Cfc , a ; r '- i +*- 1 ), V x fc G A; = 1,2, ... ,n. 

Taking the limit j — )■ oo, we obtain 

u k (z^,z k - 1 ) < u k (x k ,z k - 1 ), Vx k eX k , k = l,2,...,n. (23) 

Combining (f23]>, (EB and <|B2]> implies 

f{z k ) < ukiziz*- 1 ) < u^- 1 ^"- 1 ) = /Ce*" 1 ), fc = 1, . . . ,n. (24) 

On the other hand, the objective function is non-increasing in the algorithm and it has a limit. Thus, 

f(z°) = f(z 1 ) = ... = f(z n ). (25) 

Using (|24l . d25l . and (|23l . we obtain 

f(z)=u k (z k k ,z k - 1 )<u k (x k ,z k - 1 ), V Ifc e4 A; = 1,2,..., n. (26) 

Furthermore, /(z) = f(z k " 1 ) = u k (z k ~ 1 , z k ~ 1 ) and therefore, 

u k {z k r l ,z k - 1 )<u k {x k ,z k - l ),V x k £X k , k = l,2,...,n. (27) 

The inequalities (l26l and (|2VT > imply that z^ 1 and z£ are both the minimizer of u k (-, z k ~ l ). However, 
according to our assumption, the minimizer is unique for k = 1, 2, . . . ,n — 1 and therefore, 

z = z l = z 2 = ... =z n-l = z 

Plugging the above relation in (1231 implies 

u k (z k ,z) < u k (x k ,z), V x k £ X k , k = 1,2, ... ,n- 1. (28) 
Moreover, by setting /c = n in (|27T ). we obtain 

u n (z n , z) < u n (x n , z), V x n £ X n . (29) 
The inequalities (1281 and d29l imply that 

> 0, y d k £ R mfc with z k + d k £ X k , fc = l,2,...,n. 



u' fe (x fc ,z;4) 
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Combining this with (lB3b yields 

f'(z;d)>0, V d = (0, . . . ,d k , . . . ,0) with z k + d k 6 X k , k = l,2,...,n, 

which implies the stationarity of the point z due to the regularity of /(•). ■ 
The above result extends the existing result of block coordinate descent method and IT301 to the 
BSUM case where only an approximation of the objective function is minimized at each iteration. As 
we will see in Section I VIII I our result implies the global convergence of several existing algorithms 
including the EM algorithm or the DC method when the Gauss-Seidel update rule is used. 

V. The Maximum Improvement Successive Upper-bound Minimization Algorithm 

A key assumption for the BSUM algorithm is the uniqueness of the minimizer of the subproblem. This 
assumption is necessary even for the simple BCD method 0. In general, by removing such assumption, 
the convergence is not guaranteed (see |24| for examples) unless we assume pseudo convexity in pairs 
of the variables IT341 . ll30l . In this section, we explore the possibility of removing such uniqueness 
assumption. 

Recently, Chen et al. CQ have proposed a related Maximum Block Improvement (MBI) algorithm, 
which differs from the conventional BCD algorithm only by its update schedule. More specifically, only 
the block that provides the maximum improvement is updated at each step. Remarkably, by utilizing 
such modified updating rule (which is similar to the well known Gauss-Southwell update rule), the 
per-block subproblems are allowed to have multiple solutions. Inspired by this recent development, we 
propose to modify the BSUM algorithm similarly by simply updating the block that gives the maximum 
improvement. We name the resulting algorithm the Maximum Improvement Successive Upper-bound 
Minimization (MISUM) algorithm, and list its main steps in Fig. [3] 

Clearly the MISUM algorithm is more general than the MBI method proposed in [fl]], since only an 
approximate version of the subproblem is solved at each iteration. Theorem [3] states the convergence 
result for the proposed MISUM algorithm. 

Theorem 3 Suppose that Assumption [2] is satisfied. Then, every limit point z of the iterates generated 
by the MISUM algorithm is a coordinatewise minimum of (I12I ). In addition, if /(•) is regular at z, then 
z is a stationary point of (1121 ). 
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Let ^ r = argmin^gA', 'Ufc(x fc ,x r ~ 1 ) 
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U <JL> h, WJ UL- til 1 til U1L1 tXl V 1 V_- 1 1 1 V_- 111 111 t\. 


1 


Setx[ = x[ _1 , Vi^k 


8 


until some convergence criterion is met 



Fig. 3: Pseudo code of the MISUM algorithm 
Proof: Let us define Ri{y) to be the minimum objective value of the i-th subproblem at a point y, 

i.e., 

Ri(y) = min Ui(xi,y). 

Xi 

Using a similar argument as in Theorem [2l we can show that the sequence of the objective function 
values are non-increasing, that is 

f(x r ) = u l (xlx r )>R l (x r )>f(x r+1 ). 

Let {x Tj } be the subsequence converging to a limit point z. For every fixed block index i = 1, 2, . . . ,rt 
and every Xi £ Xi, we have the following series of inequalities 

Ui(xi,x r >) > Ri(x r >) 

>u k (x k 3 ,X J ) 

where we use k to index the block that provides the maximum improvement at iteration rj + 1. The 
first and the second inequalities are due to the definition of the function Ri(-) and the MISUM update 
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rule, respectively. The third inequality is implied by the upper bound assumption (IB2I ). while the last 
inequality is due to the non-increasing property of the objective values. 
Letting j — > oo, we obtain 

Ui(xi,z) > Ui(zi,z), V Xi € Xi, i = l,2,...,n. 

The first order optimality condition implies 

u'i(xi, z; di) > 0, V di with Zi + di£Xi, V i = 1, 2, . . . , n. 

Xi=Zi 

Combining this with dB3b yields 

f'(z; d) > 0, V d = (0, . . . , di, . . . , 0) with z% + G Afj, z = 1, 2, . . . , n. 

In other words, z is the coordinatewise minimum of /(•). ■ 
The main advantage of the MISUM algorithm over the BSUM algorithm is that its convergence does 
not rely on the uniqueness of the minimizer for the subproblems. On the other hand, each iteration of 
MISUM algorithm is more expensive than the BSUM since the minimization needs to be performed for 
all the blocks. Nevertheless, the MISUM algorithm is more suitable when parallel processing units are 
available, since the minimizations with respect to all the blocks can be carried out simultaneously. 

VI. Successive Convex Approximation of a Smooth Function 

In the previous sections, we have demonstrated that the stationary solutions of the problems (O and (fT2l) 
can be obtained by successively minimizing a sequence of upper-bounds of /(•). However, in practice, 
unless the objective /(•) possesses certain convexity /concavity structure, those upper-bounds may not be 
easily identifiable. In this section, we extend the BSUM algorithm by further relaxing the requirement 
that the approximation functions {ui(xi,y)} must be the global upper-bounds of the original objective 
/• 

Throughout this section, we use hi{., .) to denote the convex approximation function for the ith block. 
Suppose that hi(xi,x) is no longer a global upper-bound of f(x), but only a first order approximation 
of f(x) at each point, i.e., 

hfyifXidi) = f'(x;d), V d = (0, . . . , di, . . . , 0) with x t + di 6 X^ (30) 

Vi=Xi 

In this case, simply optimizing the approximate functions in each step may not even decrease the objective 
function. Nevertheless, the minimizer obtained in each step can still be used to construct a good search 
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1 Find a feasible point x G X and set r = 

2 repeat 

3 r = r + 1, i = (r mod n) + 1 

4 Let X r = argmin^g^ hi{x i} x 7 '' 1 ) 

5 Set y\ to be an arbitrary element in X r and set y r k = x r k , V k ^ i 

6 Set d r = y r — x r and choose cr G (0, 1) 

7 Armijo stepsize rule: Choose a imt > and /3 S (0, 1). Let a r be the largest 
element in {a imt P^}j=o,i,... satisfying: 

f(x r ) - f(x r + a r d r ) > -aa r f'(x r ; d r ) 

8 Set x r = x r ~ x + a r (y r - x^ 1 ) 

9 until some convergence criterion is met 

Fig. 4: Pseudo code of the BSCA algorithm 



direction, which, when combined with a proper step size selection rule, can yield a sufficient decrease 
of the objective value. 

Suppose that at iteration r, the i-th block needs to be updated. Let y\ S min^.g^ hi{yi,x r ~ 1 ) denote 
the optimal solution for optimizing the i-th approximation function at the point x r . We propose to use 
y\ — x[~ l as the search direction, and adopt the Armijo rule to guide the step size selection process. We 
name the resulting algorithm the Block Successive Convex Approximation (BSCA) algorithm. Its main 
steps are given in Figure 01 

Note that for d r = (0, . . . , d[, . . . , 0) with d\ = y\ — x\, we have 

f(x';d-) = ti,{x„x';<t:) 



AJ,0 A 



where the inequality is due to the fact that hi(-) is convex and y\ = x r i +d\ is the minimizer at iteration r. 
Moreover, there holds 

f(x r ) - f(x r + ad r ) = -af'(x r ; d r ) + o(a), V a > 0. 

Hence the Armijo step size selection rule in Figure [4] is well defined when f'(x r ;d r ) / 0, and there 
exists j G {0, 1, . . .} such that for a r = a init /3 j , 

f(x r ) - f{x r + a r d r ) > -aa r f'{x r - d r ). (32) 
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The following theorem states the convergence result of the proposed algorithm. 

Theorem 4 Suppose that /(•) is continuously differentiable and that Assumption (1301 ) holds. Furthermore, 
assume that h(x, y) is strictly convex in x and continuous in (x, y). Then every limit point of the iterates 
generated by the BSCA algorithm is a stationary point of ©. 

Proof: First of all, due to the use of Armijo step size selection rule, we have 

f(x r )-f(x r+1 )>-aa r f'(x r ;d r )>0, 

which implies 



lim a r f'(x r ;d r ) = 0. (33) 

1 — >oo 



Consider a limit point z and a subsequence {x V] }j converging to z. Since {f(x r )} is a monotonically 
decreasing sequence, it follows that 

lim f(x r ) = f(z). 



By further restricting to a subsequence if necessary, we can assume without loss of generality that in 
the subsequence {x T] }j the first block is updated. We first claim that we can further restrict to a further 
subsequence such that 

lim cf' = 0. (34) 

We prove this by contradiction. Let us assume the contrary so that there exists 5, < S < 1 and 
£e{l,2,...} 

IK J || > S, Vj > I. (35) 

Defining p rj = A?Z n , the equation (l33l) implies c/ 3 \\d ri \\f'(x rj ;p rj ) — > 0. Thus, we have the following 
two cases: 

Case A: f'(x rj ;p Tl ) — > along a subsequence of {x Tj }. Let us restrict ourselves to that subsequence. 
Since \\p r3 \\ = 1, there exists a limit point p. By further restricting to a subsequence and using the 
smoothness of /(•), we obtain 

f'(z;p) = 0. (36) 
Furthermore, due to the strict convexity of hi(-,z), 

h 1 (z 1 + 5pi,z) > h 1 (z 1 ,z) + 5h' 1 (x 1 ,z;p 1 ) >h 1 (z 1 ,z), (37) 

Zl=2i 
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where p\ is the first block of p and the last step is due to (l36l ) and (l30l) . On the other hand, since 
x{ + lies between x{ and y[ 3 , we have 

Letting j — ► oo along the subsequence, we obtain 

/ii (zi + 5pi, z) < h\ (zi , z) , (38) 

which contradicts (137T ). 

Case B: a 1 " 3 ' HcT* || — )■ along a subsequence. Let us restrict ourselves to that subsequence. Due to the 
contrary assumption (I33T ), 

lim a T] = 0, 

which further implies that there exists jo £ {1, 2, . . .} such that 

f(x r > + ^-cT') - /!>•'• ) > a'^-j'u-'- ul' ). V j > j . 

P P 

Rearranging the terms, we obtain 

^— ><T/V;p r '). vj<j . 

Letting j — ► oo along the subsequence that jf 3 — > p, we obtain 

f'(z;p)>af'(z;p), 

which implies f(z;p) > since a < 1. Therefore, using an argument similar to the previous case, (|37T) 
and (1381 ) hold, which is a contradiction. Thus, the assumption 051 ) must be false and the condition (|34l ) 
must hold. On the other hand, y\ J is the minimizer of hi(-, x T] ); thus, 

/»! (y? , ) < hx (xx , x r * ) , V xi € Xi . (39) 

Note that y\ 3 = x r { + (£{ . Combining d34l) and d39l) and letting j — > oo yield 

h\(zi, z) < h\(xi, z), x\ £ X\. 

The first order optimality condition and assumption ( f30l > imply 

/'(2f; d) >0, Vd= (di,0,...,0) with zi + diGAi. 

On the other hand, since d Tj — > 0, it follows that 

lim x Tj+1 = z. 

j-HX> 
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Therefore, by restricting ourselves to the subsequence that d Tj — > and repeating the above argument n 
times, we obtain 

f'(z;d)>0, V d = (0, . . . ,d k , . . . ,0) with z k + d k e X k ; k = 1, . . . , n. 

Using the regularity of /(•) at point z completes the proof. ■ 
We remark that the proposed BSC A method is related to the coordinate gradient descent method 
ll3Tll . in which a strictly convex second order approximation of the objective function is minimized at 
each iteration. It is important to note that the convergence results of these two algorithm do not imply 
each other. The BSCA algorithm, although more general in the sense that the approximation function 
could take the form of any strictly convex function that satisfies 001 ), only covers the case when the 
objective function is smooth. Nevertheless, the freedom provided by the BSCA to choose a more general 
approximation function allows one to better approximate the original function at each iteration. 

VII. Overlapping Essentially Cyclic Rule 

In both the BSUM and the BSCA algorithms considered in the previous sections, variable blocks are 
updated in a simple cyclic manner. In this section, we consider a very general block scheduling rule 
named the overlapping essentially cyclic rule and show they still ensure the convergence of the BSUM 
and the BSCA algorithms. 

In the so called overlapping essentially cyclic rule, at each iteration r, a group i? r of the variables is 
chosen to be updated where 

$ r C{l,2,...,n} and tf r / 0. 
Furthermore, we assume that the update rule is essentially cyclic with period T, i.e., 

T 

U^ +i = {l,2,...,n}, Vr. 

Notice that this update rule is more general than the essentially cyclic rule since the blocks are allowed 
to have overlaps. Using the overlapping essentially cyclic update rule, almost all the convergence results 
presented so far still hold. For example, the following corollary extends the convergence of BSUM to 
the overlapping essentially cyclic case. 

Corollary 2 

(a) Assume that the function Ui(xi,y) is quasi-convex in Xi and Assumption\2\is satisfied. Furthermore, 
assume that the overlapping essentially cyclic update rule is used and the subproblem (113b has a 
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unique solution for every block $ r . Then, every limit point z of the iterates generated by the BSUM 
algorithm is a coordinatewise minimum of (|12|) . In addition, if /(■) is regular at z with respect to 
the updated blocks, then z is a stationary point of (1121 ). 
(b) Assume the level set X° = {x \ f(x) < f(x )} is compact and Assumption\2\is satisfied. Further- 
more, assume that the overlapping essentially cyclic update rule is used and the subproblem d 1 31 ) 
has a unique solution for every block i? r . If /(•) is regular (with respect to the updated blocks) at 
every point in the set of stationary points X*, then the iterates generated by the BSUM algorithm 
converges to the set of stationary points, i.e., 

lim d{x r ,X*)=0. 

r— >oo 

Proof: The proof of both cases are similar to the proof of the BSUM algorithm with the simple 
cyclic update rule. Here we only present the proof for case (a). The proof of part (b) is similar. 

Let {x 7 " 3 } be a convergent subsequence whose limit is denoted by z. Consider every T updating 
cycle along the subsequence {x 1 " 3 }, namely, {(x 1 " 3 , x rj+1 , . . . ,x r ' j+T ~ 1 )}. Since the number of different 
subblocks ■d r is finite, there must exist a (fixed) T tuple of variable blocks, say (i?o>$i> • • • >$T-i)> that 
has been updated in infinitely many T updating cycles. By restricting to the corresponding subsequence 
of {x rj }, we have 

Tj+i+i _ ar g m j n u ^ (x#., x r3+% ), V i = 0, 1, 2, . . . , T — 1. 

The rest of the proof is the same as the proof of part (a) in Theorem |2l The only difference is that the 
steps of the proof need to be repeated for the blocks ($o, • • • , $t-i) instead of (1, . . . , n). ■ 
In the proof of Corollary |2j we first restrict ourselves to a fixed set of T variable blocks that have 
been updated in infinitely many consecutive T update cycles. Then, we use the same approach as in the 
proof of the convergence of cyclic update rule. Using the same technique, we can extend the results in 
Theorem [4] to the overlapping essentially cyclic update rule. More specifically, we have the following 
corollary. 

Corollary 3 Assume /(•) is smooth and the condition ( 1301 ) is satisfied. Furthermore, assume that h(x,y) 
is strictly convex in x and the overlapping essentially cyclic update rule is used in the BSCA algorithm. 
Then every limit point of the iterates generated by the BSCA algorithm is a stationary point of ©. 

Notice that the overlapping essentially cyclic rule is not applicable to the MISUM algorithm in which 
the update order of the variables is given by the amount of improvement. However, one can simply check 
that the proof of Theorem [3] still applies to the case when the blocks are allowed to have overlaps. 
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VIII. Applications 

In this section, we provide several applications of the algorithms proposed in the previous sections. 

A. Linear Transceiver Design in Cellular Networks 

Consider a J^-cell wireless network where each base station k serves a set Tf. of users (see Fig. [5] for 
an illustration). Let denote the i-th receiver in cell k. For simplicity, suppose that the users and the 
base stations are all equipped with N antennas. Let us define the set of all users as I = {i^ | 1 < k < 
K, i G Z/J. Let di k denote the number of data symbols transmitted simultaneously to user i^. 



TXl TX2 




RXl RX2 RX3 RX4 RX5 RX6 



Fig. 5: The cellular network model considered in Section IVlII-AI The solid lines represent the direct channels, 
while the dotted lines represent the interfering channels. 



When linear transceivers are used at the base stations and the users, user i&'s received signal vector, 
denoted as yi k G C^, can be written as 

h K h 

y ik = H,./,V,.s, + X Hi k fcV 4 s 4 + n '.v v < s < + n h , V ^ G Z, 

desired signal ^/ ' ■ ~ " 

intracell interference intercell interference plus noise 

where V^ G C Mxd, k is the linear transmit beamformer used by base station k for user if.; Sj fc G C dl * xl 
is user i^s data signal. The matrix Hi k j represents the channel from transmitter j to receiver i^, and 
denotes the complex additive white Gaussian noise with distribution CAA(0, af k I). User estimates the 
intended message using a linear beamformer Uj fc G C Mxdi fc: Sj fc = U^yj fc . 
Treating interference as noise, the rate of user is given by 

R lk (Y) = logdct (i + H lkk Y lk (V lk ) H H? kk (all + £ II, ; V< iX, \\" 
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We are interested in finding the beamformers V such that the sum of the users' rates are optimized 

K I k 



r v m ? X Z)Z)^( V ) 



fc-1 i-l (4Q) 

s.t. X>(V ifc V£)<n, VfcE/C. 

t=l 

Note that we have included a transmit power constraint for each base station. It has been shown in 



EOl that solving (1401 ) is NP-hard. Therefore, we try to obtain the stationary solution for this problem. 
Furthermore, we can no longer straightforwardly apply the BSUM algorithm that updates Vj fc 's cyclically. 
This is due to the fact that the users in the set ly. share a common power constraint. Thus the requirement 
for the separability of the constraints for different block components in (fT2l is not satisfied. 

To devise an efficient and low complexity algorithm for problem d40l , we will first transform this 
problem to a more suitable form. We first introduce the function fi k (Ui k , V) = logdet (E^ 1 ), where 
Ej fc is the mean square error (MSE) matrix given as 

E ijb 4(i-u?H ijbfc V i J(I-UgH ifcfc V i J s + £ i:"n,^V, V^H&U, • o? fc U£u, . 

In the subsequent presentation we will occasionally use the notation Ej fc (Uj fc , V) to make the dependency 
of the MSE matrix and the transceivers explicit. 

Taking the derivative of fi k (U^ , V) with respect to Uj fc and checking the first order optimality 
condition, we have 

-l 



arg max f ik (V ik , V) = ( of fc I + £ R. ..V, V,"TT." ; 

Plugging in the optimal value of XJi k in fi k (-), we obtain maxu ifc /i fc (Uj fc , V) = Ri k (\). Thus, we can 

rewrite the optimization problem equivalently d40l as^ 

K i k 

^^ lo g det ( E »J 

k=i i=i (41) 

Ik 

i=l 

Notice the fact that the function logdet(-) is a concave function on its argument (see, e.g., (H), then 
for any feasible Ej h , Ej fe , we have 

logdet(E ifc ) < logdet(EiJ + Tr[v Elfc (log det(E ifc ))(E ifc - E ifc )] 

= logdet(Ej +Tr[Er 1 (E ifc - E,J] 4 Uifc (E ifc ,E lfc ) (42) 

2 Such equivalence is in the sense of one-to-one correspondence of both local and global optimal solutions. See 1281 for a 
detailed argument. 
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Utilizing the above transformation and the upper bound, we can again apply the BSUM algorithm. Let 
V and U be two block variables. Define 

K I k 

u y (V, (V, U)) 4 E u ^ ( V ' Ui J, E lfc (V, U ih )) 
fc=i i=i 

u u (U, (V, U))±J2E ^ ( v > )> E ( V > u ^ )) 

fc=l 8=1 

In iteration 2r + 1, the algorithm solves the following problem 

mm u v (V,(V 2r ,U 2r )) 

i» (43) 

i=l 

In iteration 2r + 2, the algorithm solves the following (unconstrained) problem 

min u u (U,(V 2r+1 ,U 2r )) (44) 

The above BSUM algorithm for solving ( |40t is called WMMSE algorithm in the reference |28l . 
Due to (1421) . we must have that 

K I k 

u y (V, (V 2r , U 2r )) > ^ ^ log det(E ifc (V, \jf k )), for all feasible V, V i k 

k=l i=l 
K I k 

u u (U,(\ 2r+1 ,V 2r ))>^^logdet(B lk (\ 2r+ \V lk )) forallU ifc , V i k 

k=l t=l 

Moreover, other conditions in Assumption |2] are also satisfied for u v (-) and u u (-). Thus the convergence 
of the WMMSE algorithm to a stationary solution of problem (|4TT) follows directly from Theorem 12 

We briefly mention here that the main benefit of using the BSUM approach for solving problem (|4TT) 
is that in each step, the problem (l43l can be decomposed into K independent convex subproblems, one 
for each base station k G K,. Moreover, the solutions for these K subproblems can be simply obtained 
in closed form (subject to an efficient bisection search). For more details on this algorithm, we refer the 
readers to l28ll and l25t 

The BSUM approach has been extensively used for resource allocation in wireless networks, for 
example (H, |[T6l , |fT8l , ||23l , ||27l , and |[26l . However, the convergence of most of the algorithms was 
not rigorously established. 
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B. Proximal Minimization Algorithm 

The classical proximal minimization algorithm (see, e.g., ||3] Section 3.4.3]) obtains a solution of the 
problem min xe ^ /(x) by solving an equivalent problem 

min /(x) + — ||x-y|||, (45) 

x£X,y€X 2c 

where /(•) is a convex function, X is a closed convex set, and c > is a scalar parameter. The equivalent 
problem (1431) is attractive in that it is strongly convex in both x and y (but not jointly) so long as f(x) 
is convex. This problem can be solved by performing the following two steps in an alternating fashion 

x r+1 = argmin{/(x) + ^-||x — y^Hl l (46) 

y' + ! = X r+1 . (47) 

Equivalently, let u(x; x r ) 4 /( x ) + i||x - x r ||2, then the iteration ggH53 

can be written as 

x r+1 = argminu(x,x r ). (48) 

n£X 

It can be straightforwardly checked that for all x, x r G X, the function u(x, x r ) serves as an upper 
bound for the function /(x). Moreover, the conditions listed in Assumption Q] are all satisfied. Clearly, the 
iteration (|48T ) corresponds to the SUM algorithm discussed in Section [Hi] Consequently, the convergence 
of the proximal minimization procedure can be obtained from Theorem [TJ 

The proximal minimization algorithm can be generalized in the following way. Consider the problem 

min /(xi, • • • ,x n ) (49) 

X 

s.t. Xj G Xi, i = 1, • • • , n, 

where {Afj}" =1 are closed convex sets, /(•) is convex in each of its block components, but not necessarily 
strictly convex. A straightforward application of the BCD procedure may fail to find a stationary solution 
for this problem, as the per-block subproblems may contain multiple solutions. Alternatively, we can 
consider an alternating proximal minimization algorithm lfT2l . in each iteration of which the following 
subproblem is solved 

min /(x£, . . . ) Xj_ lj Xi,x- +1 , . . . ,x r n ) + — ||xj - x[||| (50) 
s.t. Xj G Xi. 

It is not hard to see that this subproblem always admits a unique solution, as the objective is a strictly 
convex function of Xj. Let u,,(xj,x r ) = f(-x[,--- ,Xj, + jr||xj — x[|||. Again for each x, G Xi 
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and x r G EL/''^?' tne function Uj(x.j,x r ) is an upper bound of the original objective /(x). Moreover, 
all the conditions in Assumption |2] are satisfied. Utilizing Theorem 12 we conclude that the alternating 
proximal minimization algorithm must converge to a stationary solution of the problem ( |49l ). Moreover, 
our result extends those in |[T2l to the case of nonsmooth objective function as well as the case with 
iteration-dependent coefficient c. The latter case, which was also studied in the contemporary work 11321 , 
will be demonstrated in an example for tensor decomposition shortly. 

C. Proximal Splitting Algorithm 

The proximal splitting algorithm (see, e.g., (9j) for nonsmooth optimization is also a special case of 
the BSUM algorithm. Consider the following problem 

min/ 1 (x) + / 2 (x) (51) 

x€/t 

where X is a closed and convex set. Furthermore, f\ is convex and lower semicontinuous; / 2 is convex 
and has Lipschitz continuous gradient, i.e., ||V/2(x) — V/2(y)|| < /3||x — y||, VxjG/f and for some 
P > 0. 

Define the proximity operator proxj. : X — > X as 

prox f (x) = argmin/i(y) + i||x- y|| 2 . (52) 
y&x 2 

The following forward-backward splitting iteration can be used to obtain a solution for problem (TSTT ) [9]: 



x 



r+l 



prox 7/i (x r - 7 V/ 2 (x r )) (53) 



backward step forward step 

where 7 G [e,2//J - e] with e e]0,min{l, l//3}[. Define 

n(x, x r ) 4 /l(x ) + J-||x - xH| 2 + (x - x r , V/ 2 (x r )) + / 2 (x r ). (54) 
2 7 



We first show that the iteration (1531 ) is equivalent to the following iteration 



,r+l 



argminn(x,x r ). (55) 



From the definition of the prox operation, we have 



1 „ _ » / 7- X I I 2 



prox 7/i (x r - 7V/ 2 (x r )) = argmin7/i(x) + -||x - x r + 7V/ 2 (x r )|| 2 

= argmin/i(x) + — ||x - x r || 2 + (x - x r , V/ 2 (x r )) 
xex 27 

= arg min u(x, x r ). 
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We then show that u(x, x r ) is an upper bound of the original function /i(x) + /2(x), for all x, x r G X. 
Note that from the well known Descent Lemma [2, Proposition A.32], we have that 

/ 2 (x) < / 2 (x r ) + f ||x - x'|| 2 + (x - x', V/ 2 (x r )) 

< / 2 (x r ) + J-||x - xl 2 + (x - x f ,V/ 2 (x f )) 
2 7 

where the second inequality is from the definition of 7. This result implies that u(x, y) > /i(x) + 
/2(x), V x,y G A'. Moreover, we can again verify that all the other conditions in Assumption Q] is true. 
Consequently, we conclude that the forward-backward splitting algorithm is a special case of the SUM 
algorithm. 

Similar to the previous example, we can generalize the forward-backward splitting algorithm to the 
problem with multiple block components. Consider the following problem 

n 

min J~] fj (xj) + f n+ i (xi , • • • , x n ) (56) 

i=l 

s.t. Xj G Afj, i = 1, • • • , n 

where {A^}™ =1 are a closed and convex sets. Each function fi(-), i = 1, • • • n is convex and lower 
semicontinuous w.r.t. x^; f n+ \{-) is convex and has Lipschitz continuous gradient w.r.t. each of the 
component x», i.e., ||V/ n+ i(x) - V/ n+ i(y)|| < /3i||x» - yi||, V x^y^ G X h i = 1, • • • , n. Then the 
following block forward-backward splitting algorithm can be shown as a special case of the BSUM 
algorithm, and consequently converges to a stationary solution of the problem (1561 ) 

x[ +1 = prox 7/i (x[ - 7V Xi / n+ i(x r )), i = 1,2, ...,n, 

where 7 G [e*, - e 4 ] with e, g]0, min{l, 

D. CANDECOMP/PARAFAC Decomposition of Tensors 

Another application of the proposed method is in CANDECOMP/PARAFAC (CP) decomposition of 
tensors. Given a tensor X G K™ lXm2X - xm " of order n, the idea of CP decomposition is to write the 
tensor as the sum of rank-one tensors: 

R 

X — ^ ' X r , 

r=l 

where X r = a\ r o a 2 r ■ ■ ■ QW and aj r G M mi . Here the notation " o " denotes the outer product. 
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In general, finding the CP decomposition of a given tensor is NP-hard irTSI . In practice, one of the 
most widely accepted algorithms for computing the CP decomposition of a tensor is the Alternating 
Least Squares (ALS) algorithm ifTTTl . IPT91 , |[29l . The ALS algorithm proposed in (6], |fT3l is in essence a 
BCD method. For ease of presentation, we will present the ALS algorithm only for tensors of order three. 

Let X G y^ixJxK b e a third order tensor. Let (A; B; C) represent the following decomposition 

R 

(A; B; C) = a r o b r o c r , 

r=l 

where a r (resp. b r and c r ) is the r-th column of A (resp. B and C). The ALS algorithm minimizes the 
difference between the original and the reconstructed tensors 

min \\X-(A;B;C)\\, (57) 
A,B,C 

where A G R IxR , B G R JxR , C G R Xxfi , and is the rank of the tensor. 

The ALS approach is a special case of the BCD algorithm in which the three blocks of variables A, B, 
and C are cyclically updated. In each step of the computation when two blocks of variables are held 
fixed, the subproblem becomes the quadratic least squares problem and admits closed form updates (see 

ED). 

One of the well-known drawbacks of the ALS algorithm is the swamp effect where the objective 
value remains almost constant for many iterations before starting to decrease again. Navasca et al. in 
ll22l observed that adding a proximal term in the algorithm could help reducing the swamp effect. More 
specifically, at each iteration r the algorithm proposed in E2l solves the following problem for updating 
the variables: 

\\X - (A; B; C)\\ 2 + X\\A - A r \\ 2 + X\\B - B r \\ 2 + A||C - C r \\ 2 , (58) 

where A G R is a positive constant. As discussed before, this proximal term has been considered in 
different optimization contexts and its convergence has been already showed in lfT2l . An interesting 
numerical observation in ll22l is that decreasing the value of A during the algorithm can noticeably 
improve the convergence of the algorithm. Such iterative decrease of A can be accomplished in a number 
of different ways. Our numerical experiments show that the following simple approach to update A can 
significantly improve the convergence of the ALS algorithm and substantially reduce the swamp effect: 

y^ + Al t™, (59) 
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where A r is the proximal coefficient A at iteration r. Theorem [2] implies the convergence is guaranteed 
even with this update rule of A, whereas the convergence result of lTT2l does not apply in this case since 
the proximal coefficient is changing during the iterations. 

Figure [6] shows the performance of different algorithms for the example given in |[22ll where the 
tensor X is obtained from the decomposition 
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The vertical axis is the value of the objective function where the horizontal axis is the iteration number. 
In this plot, ALS is the classical alternating least squares algorithm. The curve for Constant Proximal 
shows the performance of the BSUM algorithm when we use the objective function in (T58T ) with A = 0.1. 
The curve for Diminishing Proximal shows the performance of block coordinate descent method on d58l ) 
where the weight A decreases iteratively according to d59l) with Ao = 10~ 7 ,Ai = 0.1. The other two 
curves MBI and MISUM correspond to the maximum block improvement algorithm and the MISUM 
algorithm. In the implementation of the MISUM algorithm, the proximal term is of the form in ( |58T ) and 
the weight A is updated based on d59b . 




50 100 150 200 250 300 350 400 

Iterates 



Fig. 6: Convergence of Different Algorithms 

Table U represents the average number of iterations required to get an objective value less than e = 10~ 5 
for different algorithms. The average is taken over 1000 Monte-Carlo runs over different initializations. 
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/WCIdgC I1UII1UC1 Ul ILClaLlUllS 1U1 CUllVCIgCIlCC 


ALS 


277 


Constant Proximal 


140 


Diminishing Proximal 


78 


MB I 


572 


MISUM 


175 



TABLE I: Average number of iterations for convergence 



The initial points are generated randomly where the components of the variables A, B, and C are drawn 
independently from the uniform distribution over the unit interval [0,1]. As it can be seen, adding a 
diminishing proximal term significantly improves the convergence speed of the ALS algorithm. 

E. Expectation Maximization Algorithm 

The expectation maximization algorithm (EM) in |[T0l is an iterative procedure for maximum likelihood 
estimation when some of the random variables are unobserved/hidden. Let w be the observed random 
vector which is used for estimating the value of 9. The maximum likelihood estimate of 9 can be given 
as 

9ml = argmax \n.p(w\9). (60) 

6 

Let the random vector z be the hidden/unobserved variable. The EM algorithm starts from an initial 
estimate 9 and generates a sequence {9 r } by repeating the following steps: 

. E-Step: Calculate g(9, 9 r ) = ^ z \ wfir {hxp{w,z\9)} 

. M-Step: 9 r+1 = argmax e g(9,9 r ) 
The EM-algorithm can be viewed as a special case of SUM algorithm (H. In fact, we are interested in 
solving the following optimization problem 

min — In p(w\9). 
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The objective function could be written as 



]np(w\9) 




= hLp(w, z\9) + E^^^r lnp(z|to, 9 r ) 



where the inequality is due to the Jensen's inequality and the third equality follows from a simple change 
of the order of integration for the expectation. Since ~E z \ w ^ r \np(z\w,9 r ) is not a function of 9, the 
M-step in the EM-algorithm can be written as 



Furthermore, it is not hard to see that u(9 r ,9 r ) = — \np(w\9 r ). Therefore, under the smoothness 
assumption, Proposition Q] implies that Assumption Q] is satisfied. As an immediate consequence, the EM- 
algorithm is a special case of the SUM algorithm. Therefore, our result implies not only the convergence 
of the EM-algorithm, but also the convergence of the EM-algorithm with Gauss-Seidel/coordinatewise 
update rule (under the assumptions of Theorem [2]). In fact in the block coordinate EM-algorithm (BEM), 
at each M-step, only one block is updated. More specifically, let 9 = (9\, . . . ,9 n ) be the unknown 
parameter. Assume w is the observed vector and z is the hidden/unobserved variable as before. The 
BEM algorithm starts from an initial point 9° = (9®, . . . , 9®) and generates a sequence {9 r } according 
to the algorithm in Figure [7] 

The motivation behind using the BEM algorithm instead of the EM algorithm could be the difficulties 
in solving the M-step of EM for the entire set of variables, while solving the same problem per block of 
variables is easy. To the best of our knowledge, the BEM algorithm and its convergence behavior have 
not been analyzed before. 

F. Concave-Convex Procedure/Difference of Convex Functions 

A popular algorithm for solving unconstrained problems, which also belongs to the class of successive 
upper-bound minimization, is the Concave-Convex Procedure (CCCP) introduced in ||33l . In CCCP, also 



argmaxu(#, 9 r ). 
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1 Initialize with 9® and set v = 






— 1 C|JC<1L 






3 r = r + 1, i = r mod n + 1 






4 E-Stetr off?- 6> r ) — E i dlnnf?;; zlf?T 




/9 r )T 

• ) u n)S 


5 M-Step: cT +1 = argmax^ gi(9i,6 r ) 






6 until some convergence criterion is met 







Fig. 7: Pseudo code of the BEM algorithm 



known as the difference of convex functions (DC) programming, we consider the unconstrained problem 

min fix), 

where f{x) = fcve(%) + / cra (i),V x € M m ; where f C ve{-) is a concave function and f cvx {-) is convex. 
The CCCP generates a sequence {x r } by solving the following equation: 

^ fcvx{% ) — V ' fcveipC )> 

which is equivalent to 

£ r+1 = argmin g(x,x r ), (61) 

where x r ) = / cra; (x) + (:E— £ r ) T V/ c?J e(x r ")+,/ CT , e (x r ). Clearly, x r ) is a tight convex upper-bound 
of f(x) and hence CCCP is a special case of the SUM algorithm and its convergence is guaranteed by 
Theorem Q] under certain assumptions. Furthermore, if the updates are done in a block coordinate manner, 
the algorithm becomes a special case of BSUM whose convergence is guaranteed by Theorem [2] To the 
best of our knowledge, the block coordinate version of CCCP algorithm and its convergence has not been 
studied before. 
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